perception test
An Approach to Grounding AI Model Evaluations in Human-derived Criteria
In the rapidly evolving field of artificial intelligence (AI), traditional benchmarks can fall short in attempting to capture the nuanced capabilities of AI models. We focus on the case of physical world modeling and propose a novel approach to augment existing benchmarks with human-derived evaluation criteria, aiming to enhance the interpretability and applicability of model behaviors. Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large-scale surveys to identify key cognitive skills, such as Prioritization, Memorizing, Discerning, and Contextualizing, that are critical for both AI and human reasoning. Our findings reveal that participants perceive AI as lacking in interpretive and empathetic skills yet hold high expectations for AI performance. By integrating insights from our findings into benchmark design, we offer a framework for developing more human-aligned means of defining and measuring progress. This work underscores the importance of user-centered evaluation in AI development, providing actionable guidelines for researchers and practitioners aiming to align AI capabilities with human cognitive processes. Our approach both enhances current benchmarking practices and sets the stage for future advancements in AI model evaluation.
- Research Report (1.00)
- Questionnaire & Opinion Survey (0.88)
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split.
The Sense of Agency in Assistive Robotics Using Shared Autonomy
Collier, Maggie A., Narayan, Rithika, Admoni, Henny
Sense of agency is one factor that influences people's preferences for robot assistance and a phenomenon from cognitive science that represents the experience of control over one's environment. However, in assistive robotics literature, we often see paradigms that optimize measures like task success and cognitive load, rather than sense of agency. In fact, prior work has found that participants sometimes express a preference for paradigms, such as direct teleoperation, which do not perform well with those other metrics but give more control to the user. In this work, we focus on a subset of assistance paradigms for manipulation called shared autonomy in which the system combines control signals from the user and the automated control. We run a study to evaluate sense of agency and show that higher robot autonomy during assistance leads to improved task performance but a decreased sense of agency, indicating a potential trade-off between task performance and sense of agency. From our findings, we discuss the relation between sense of agency and optimality, and we consider a proxy metric for a component of sense of agency which might enable us to build systems that monitor and maintain sense of agency in real time.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
- Europe > Switzerland > Fribourg > Fribourg (0.04)
- (2 more...)
- Research Report > Experimental Study (0.94)
- Research Report > New Finding (0.66)
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models
Loginova, Olga, Bezrukov, Oleksandr, Kravets, Alexey
Evaluating Video Language Models (VLMs) is a challenging task. Due to its transparency, Multiple-Choice Question Answering (MCQA) is widely used to measure the performance of these models through accuracy. However, existing MCQA benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias, when models disproportionately favor certain answer options based on positional patterns observed during training. In this work, we conduct a comprehensive empirical analysis of several VLM architectures across major datasets designed to assess complex video-focused reasoning. We identify where the bias is most pronounced and demonstrate to what extent model responses reflect genuine understanding of video content and related questions, as opposed to reliance on arbitrary patterns or superficial cues, such as answer position. By decomposing the MCQA task and adapting fairness bias metrics to VLMs, we introduce a post-processing calibration technique BOLD to balance this bias. Our results show that reducing selection bias improves not only debiasing metrics but also overall model performance, including Accuracy and F1 Mean score. Our method, by suppressing "blind guessing", offers a more cost- and time-effective approach to mitigating selection bias compared to existing techniques. This study represents the first focused investigation of selection bias in video-to-text LLM-powered models.
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > Singapore (0.04)
- Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
- Asia > British Indian Ocean Territory > Diego Garcia (0.04)
- Questionnaire & Opinion Survey (1.00)
- Research Report > New Finding (0.85)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
TraveLER: A Multi-LMM Agent Framework for Video Question-Answering
Shang, Chuyi, You, Amos, Subramanian, Sanjay, Darrell, Trevor, Herzig, Roei
Recently, Large Multimodal Models (LMMs) have made significant progress in video question-answering using a frame-wise approach by leveraging large-scale, image-based pretraining in a zero-shot manner. While image-based methods for videos have shown impressive performance, a current limitation is that they often overlook how key timestamps are selected and cannot adjust when incorrect timestamps are identified. Moreover, they are unable to extract details relevant to the question, instead providing general descriptions of the frame. To overcome this, we design a multi-LMM agent framework that travels along the video, iteratively collecting relevant information from keyframes through interactive question-asking until there is sufficient information to answer the question. Specifically, we propose TraveLER, a model that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question. Finally, if there is not enough information, our method is able to "Replan" based on its collected knowledge. Through extensive experiments, we find that the proposed TraveLER approach improves performance on several video question-answering benchmarks, such as NExT-QA, STAR, and Perception Test, without the need to fine-tune on specific datasets.
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
Pătrăucean, Viorica, Smaira, Lucas, Gupta, Ankush, Continente, Adrià Recasens, Markeeva, Larisa, Banarse, Dylan, Koppula, Skanda, Heyward, Joseph, Malinowski, Mateusz, Yang, Yi, Doersch, Carl, Matejovicova, Tatiana, Sulsky, Yury, Miech, Antoine, Frechette, Alex, Klimczak, Hanna, Koster, Raphael, Zhang, Junlin, Winkler, Stephanie, Aytar, Yusuf, Osindero, Simon, Damen, Dima, Zisserman, Andrew, Carreira, João
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a substantial gap in performance (91.4% vs 46.2%), suggesting that there is significant room for improvement in multimodal video understanding. Dataset, baseline code, and challenge server are available at https://github.com/deepmind/perception_test
- South America > Brazil (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > California > Marin County > Novato (0.04)
- (12 more...)
Chinese Learners' Phonetic Transfer of /i/ from Mandarin Chinese to General American English: A Case Study of a Chinese Learner with Advanced English
The current paper concerns language transfer at the phonetic level and concentrates on the transfer phenomenon in an advanced English language learner's acquisition of the English vowels /i/ and its lax counterpart. By determining whether the Chinese English-language learner (ELL), named Vanya, can accurately distinguish between /i/ and its lax counterpart, and pronounce them precisely in General American English (GAE), this paper serves as a reference for further studying language transfer among Chinese ELLs. There were two objectives: first, the learner's perceptual ability to distinguish between vowels /i/ and its lax counterpart was examined; second, the effect of the phonetic transfer was determined. Two perception tests and a production test were used to attain these two objectives. The results of two perception tests demonstrated Vanya's perceptual competence in distinguishing between /i/ and its lax counterpart and laid a solid foundation for the validity of the subsequent production test. Given that Vanya's production of F1 and F2 values of /i/ were highly similar across his first language (Mandarin Chinese) and second language (GAE) and that both values were lower than the typical values for common /i/ in GAE, with an especially prominent disparity between the F2 values, it is reasonable to conclude that a phonetic transfer occurred. The participant's high perceptual competence as an advanced-level ELL did not noticeably moderate the effect of phonetic transfer.
- Asia > China (0.05)
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)