AITopics | perception test

Collaborating Authors

perception test

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

Neural Information Processing SystemsDec-26-2025, 06:51:37 GMT

We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g.

diagnostic benchmark, name change, perception test, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.35)
Information Technology > Artificial Intelligence > Machine Learning (0.35)

Add feedback

An Approach to Grounding AI Model Evaluations in Human-derived Criteria

Mitts, Sasha

arXiv.org Artificial IntelligenceSep-8-2025

In the rapidly evolving field of artificial intelligence (AI), traditional benchmarks can fall short in attempting to capture the nuanced capabilities of AI models. We focus on the case of physical world modeling and propose a novel approach to augment existing benchmarks with human-derived evaluation criteria, aiming to enhance the interpretability and applicability of model behaviors. Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large-scale surveys to identify key cognitive skills, such as Prioritization, Memorizing, Discerning, and Contextualizing, that are critical for both AI and human reasoning. Our findings reveal that participants perceive AI as lacking in interpretive and empathetic skills yet hold high expectations for AI performance. By integrating insights from our findings into benchmark design, we offer a framework for developing more human-aligned means of defining and measuring progress. This work underscores the importance of user-centered evaluation in AI development, providing actionable guidelines for researchers and practitioners aiming to align AI capabilities with human cognitive processes. Our approach both enhances current benchmarking practices and sets the stage for future advancements in AI model evaluation.

benchmark, machine learning, natural language, (12 more...)

arXiv.org Artificial Intelligence

2509.04676

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)

Add feedback

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

Neural Information Processing SystemsJan-19-2025, 12:39:11 GMT

For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split.

diagnostic benchmark, multimodal video model, perception test, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.37)
Information Technology > Artificial Intelligence > Machine Learning (0.37)

Add feedback

The Sense of Agency in Assistive Robotics Using Shared Autonomy

Collier, Maggie A., Narayan, Rithika, Admoni, Henny

arXiv.org Artificial IntelligenceJan-13-2025

Sense of agency is one factor that influences people's preferences for robot assistance and a phenomenon from cognitive science that represents the experience of control over one's environment. However, in assistive robotics literature, we often see paradigms that optimize measures like task success and cognitive load, rather than sense of agency. In fact, prior work has found that participants sometimes express a preference for paradigms, such as direct teleoperation, which do not perform well with those other metrics but give more control to the user. In this work, we focus on a subset of assistance paradigms for manipulation called shared autonomy in which the system combines control signals from the user and the automated control. We run a study to evaluate sense of agency and show that higher robot autonomy during assistance leads to improved task performance but a decreased sense of agency, indicating a potential trade-off between task performance and sense of agency. From our findings, we discuss the relation between sense of agency and optimality, and we consider a proxy metric for a component of sense of agency which might enable us to build systems that monitor and maintain sense of agency in real time.

artificial intelligence, autonomy, participant, (16 more...)

arXiv.org Artificial Intelligence

2501.07462

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre:

Research Report > Experimental Study (0.94)
Research Report > New Finding (0.66)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models

Loginova, Olga, Bezrukov, Oleksandr, Kravets, Alexey

arXiv.org Artificial IntelligenceOct-18-2024

Evaluating Video Language Models (VLMs) is a challenging task. Due to its transparency, Multiple-Choice Question Answering (MCQA) is widely used to measure the performance of these models through accuracy. However, existing MCQA benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias, when models disproportionately favor certain answer options based on positional patterns observed during training. In this work, we conduct a comprehensive empirical analysis of several VLM architectures across major datasets designed to assess complex video-focused reasoning. We identify where the bias is most pronounced and demonstrate to what extent model responses reflect genuine understanding of video content and related questions, as opposed to reliance on arbitrary patterns or superficial cues, such as answer position. By decomposing the MCQA task and adapting fairness bias metrics to VLMs, we introduce a post-processing calibration technique BOLD to balance this bias. Our results show that reducing selection bias improves not only debiasing metrics but also overall model performance, including Accuracy and F1 Mean score. Our method, by suppressing "blind guessing", offers a more cost- and time-effective approach to mitigating selection bias compared to existing techniques. This study represents the first focused investigation of selection bias in video-to-text LLM-powered models.

large language model, machine learning, question answering, (20 more...)

arXiv.org Artificial Intelligence

2410.14248

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
Asia > British Indian Ocean Territory > Diego Garcia (0.04)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.85)

Industry: Education (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

TraveLER: A Multi-LMM Agent Framework for Video Question-Answering

Shang, Chuyi, You, Amos, Subramanian, Sanjay, Darrell, Trevor, Herzig, Roei

arXiv.org Artificial IntelligenceApr-1-2024

Recently, Large Multimodal Models (LMMs) have made significant progress in video question-answering using a frame-wise approach by leveraging large-scale, image-based pretraining in a zero-shot manner. While image-based methods for videos have shown impressive performance, a current limitation is that they often overlook how key timestamps are selected and cannot adjust when incorrect timestamps are identified. Moreover, they are unable to extract details relevant to the question, instead providing general descriptions of the frame. To overcome this, we design a multi-LMM agent framework that travels along the video, iteratively collecting relevant information from keyframes through interactive question-asking until there is sufficient information to answer the question. Specifically, we propose TraveLER, a model that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question. Finally, if there is not enough information, our method is able to "Replan" based on its collected knowledge. Through extensive experiments, we find that the proposed TraveLER approach improves performance on several video question-answering benchmarks, such as NExT-QA, STAR, and Perception Test, without the need to fine-tune on specific datasets.

information, preprint, video, (16 more...)

arXiv.org Artificial Intelligence

2404.01476

Country: North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.82)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

Pătrăucean, Viorica, Smaira, Lucas, Gupta, Ankush, Continente, Adrià Recasens, Markeeva, Larisa, Banarse, Dylan, Koppula, Skanda, Heyward, Joseph, Malinowski, Mateusz, Yang, Yi, Doersch, Carl, Matejovicova, Tatiana, Sulsky, Yury, Miech, Antoine, Frechette, Alex, Klimczak, Hanna, Koster, Raphael, Zhang, Junlin, Winkler, Stephanie, Aytar, Yusuf, Osindero, Simon, Damen, Dima, Zisserman, Andrew, Carreira, João

arXiv.org Artificial IntelligenceOct-30-2023

We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a substantial gap in performance (91.4% vs 46.2%), suggesting that there is significant room for improvement in multimodal video understanding. Dataset, baseline code, and challenge server are available at https://github.com/deepmind/perception_test

baseline, perception test, video, (15 more...)

arXiv.org Artificial Intelligence

2305.13786

Country:

South America > Brazil (0.04)
North America > United States > Virginia (0.04)
North America > United States > California > Marin County > Novato (0.04)
(12 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Chinese Learners' Phonetic Transfer of /i/ from Mandarin Chinese to General American English: A Case Study of a Chinese Learner with Advanced English

Chen, Lintao

arXiv.org Artificial IntelligenceOct-20-2022

The current paper concerns language transfer at the phonetic level and concentrates on the transfer phenomenon in an advanced English language learner's acquisition of the English vowels /i/ and its lax counterpart. By determining whether the Chinese English-language learner (ELL), named Vanya, can accurately distinguish between /i/ and its lax counterpart, and pronounce them precisely in General American English (GAE), this paper serves as a reference for further studying language transfer among Chinese ELLs. There were two objectives: first, the learner's perceptual ability to distinguish between vowels /i/ and its lax counterpart was examined; second, the effect of the phonetic transfer was determined. Two perception tests and a production test were used to attain these two objectives. The results of two perception tests demonstrated Vanya's perceptual competence in distinguishing between /i/ and its lax counterpart and laid a solid foundation for the validity of the subsequent production test. Given that Vanya's production of F1 and F2 values of /i/ were highly similar across his first language (Mandarin Chinese) and second language (GAE) and that both values were lower than the typical values for common /i/ in GAE, with an especially prominent disparity between the F2 values, it is reasonable to conclude that a phonetic transfer occurred. The participant's high perceptual competence as an advanced-level ELL did not noticeably moderate the effect of phonetic transfer.

artificial intelligence, phonetic transfer, vanya, (15 more...)

arXiv.org Artificial Intelligence

2112.13571

Country:

Asia > China (0.05)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.40)

Industry: Education (1.00)

Technology: Information Technology > Artificial Intelligence (0.49)

Add feedback