AITopics | Nagrani, Arsha

Collaborating Authors

Nagrani, Arsha

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Neptune: The Long Orbit to Benchmarking Long Video Understanding

Nagrani, Arsha, Zhang, Mingda, Mehran, Ramin, Hornung, Rachel, Gundavarapu, Nitesh Bharadwaj, Jha, Nilpa, Myers, Austin, Zhou, Xingyi, Gong, Boqing, Schmid, Cordelia, Sirotenko, Mikhail, Zhu, Yukun, Weyand, Tobias

arXiv.org Artificial IntelligenceDec-12-2024

This paper describes a semi-automatic pipeline to generate challenging question-answer-decoy sets for understanding long videos. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.09582

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports (1.00)
Health & Medicine (0.93)
Media > Film (0.92)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

The VoxCeleb Speaker Recognition Challenge: A Retrospective

Huh, Jaesung, Chung, Joon Son, Nagrani, Arsha, Brown, Andrew, Jung, Jee-weon, Garcia-Romero, Daniel, Zisserman, Andrew

arXiv.org Artificial IntelligenceAug-27-2024

The VoxCeleb Speaker Recognition Challenges (VoxSRC) were a series of challenges and workshops that ran annually from 2019 to 2023. The challenges primarily evaluated the tasks of speaker recognition and diarisation under various settings including: closed and open training data; as well as supervised, self-supervised, and semi-supervised training for domain adaptation. The challenges also provided publicly available training and evaluation datasets for each task and setting, with new test sets released each year. In this paper, we provide a review of these challenges that covers: what they explored; the methods developed by the challenge participants and how these evolved; and also the current state of the field for speaker verification and diarisation. We chart the progress in performance over the five installments of the challenge on a common evaluation dataset and provide a detailed analysis of how each year's special focus affected participants' performance. This paper is aimed both at researchers who want an overview of the speaker recognition and diarisation field, and also at challenge organisers who want to benefit from the successes and avoid the mistakes of the VoxSRC challenges. We end with a discussion of the current strengths of the field and open challenges. Project page : https://mm.kaist.ac.kr/datasets/voxceleb/voxsrc/workshop.html

artificial intelligence, machine learning, pattern recognition, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TASLP.2024.3444456

2408.14886

Country:

North America > United States (1.00)
Asia (1.00)
Europe > United Kingdom > England > Oxfordshire (0.28)

Genre:

Overview (1.00)
Research Report > Experimental Study (0.46)

Industry:

Information Technology > Security & Privacy (0.92)
Media (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.91)
(2 more...)

Add feedback

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

Min, Juhong, Buch, Shyamal, Nagrani, Arsha, Cho, Minsu, Schmid, Cordelia

arXiv.org Artificial IntelligenceApr-9-2024

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).

large language model, machine learning, question answering, (20 more...)

arXiv.org Artificial Intelligence

2404.06511

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

Video Summarization: Towards Entity-Aware Captions

Ayyubi, Hammad A., Liu, Tianqi, Nagrani, Arsha, Lin, Xudong, Zhang, Mingda, Arnab, Anurag, Han, Feng, Zhu, Yukun, Liu, Jialu, Chang, Shih-Fu

arXiv.org Artificial IntelligenceDec-1-2023

Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.

caption, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2312.02188

Country:

Europe (1.00)
Asia > Middle East > Iraq (1.00)
Africa (0.93)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Government > Military (1.00)
Leisure & Entertainment (0.93)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)

Add feedback

VidChapters-7M: Video Chapters at Scale

Yang, Antoine, Nagrani, Arsha, Laptev, Ivan, Sivic, Josef, Schmid, Cordelia

arXiv.org Artificial IntelligenceSep-25-2023

Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. We introduce the following three tasks based on this data. First, the video chapter generation task consists of temporally segmenting the video and generating a chapter title for each segment. To further dissect the problem, we also define two variants of this task: video chapter generation given ground-truth boundaries, which requires generating a chapter title given an annotated video segment, and video chapter grounding, which requires temporally localizing a chapter given its annotated title. We benchmark both simple baselines and state-of-the-art video-language models for these three tasks. We also show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings, largely improving the state of the art on the YouCook2 and ViTT benchmarks. Finally, our experiments reveal that downstream performance scales well with the size of the pretraining dataset. Our dataset, code, and models are publicly available at https://antoyang.github.io/vidchapters.html.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2309.13952

Country: Europe (0.28)

Genre: Research Report > New Finding (0.67)

Industry:

Law (1.00)
Information Technology (0.93)
Government (0.92)
(2 more...)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

LanSER: Language-Model Supported Speech Emotion Recognition

Gong, Taesik, Belanich, Josh, Somandepalli, Krishna, Nagrani, Arsha, Eoff, Brian, Jou, Brendan

arXiv.org Artificial IntelligenceSep-7-2023

Speech emotion recognition (SER) models typically rely on costly human-labeled data for training, making scaling methods to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER, a method that enables the use of unlabeled data by inferring weak emotion labels via pre-trained large language models through weakly-supervised learning. For inferring weak labels constrained to a taxonomy, we use a textual entailment approach that selects an emotion label with the highest entailment score for a speech transcript extracted via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and show improved label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations appear to model the prosodic content of speech.

artificial intelligence, machine learning, natural language, (3 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2023-1832

2309.03978

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.60)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.53)

Add feedback

UnLoc: A Unified Framework for Video Localization Tasks

Yan, Shen, Xiong, Xuehan, Nagrani, Arsha, Arnab, Anurag, Wang, Zhonghao, Ge, Weina, Ross, David, Schmid, Cordelia

arXiv.org Artificial IntelligenceAug-21-2023

While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task. We design a new approach for this called UnLoc, which uses pretrained image and text towers, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables Moment Retrieval, Temporal Localization, and Action Segmentation with a single stage model, without the need for action proposals, motion based pretrained features or representation masking. Unlike specialized models, we achieve state of the art results on all three different localization tasks with a unified approach. Code will be available at: \url{https://github.com/google-research/scenic}.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2308.11062

Country: Asia (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.34)

Add feedback

Modular Visual Question Answering via Code Generation

Subramanian, Sanjay, Narasimhan, Medhini, Khangaonkar, Kushal, Yang, Kevin, Nagrani, Arsha, Schmid, Cordelia, Zeng, Andy, Darrell, Trevor, Klein, Dan

arXiv.org Artificial IntelligenceJun-8-2023

We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning. The generated Python programs invoke and compose the outputs of the visual models using arithmetic and conditional logic. Our approach improves accuracy on the COVR dataset by at least 3% and on the GQA dataset by roughly 2% compared to the few-shot baseline that does not employ code generation.

artificial intelligence, natural language, question answering, (15 more...)

arXiv.org Artificial Intelligence

2306.05392

Country:

North America > United States (0.46)
Europe (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.81)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)

Add feedback

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Chen, Xi, Djolonga, Josip, Padlewski, Piotr, Mustafa, Basil, Changpinyo, Soravit, Wu, Jialin, Ruiz, Carlos Riquelme, Goodman, Sebastian, Wang, Xiao, Tay, Yi, Shakeri, Siamak, Dehghani, Mostafa, Salz, Daniel, Lucic, Mario, Tschannen, Michael, Nagrani, Arsha, Hu, Hexiang, Joshi, Mandar, Pang, Bo, Montgomery, Ceslee, Pietrzyk, Paulina, Ritter, Marvin, Piergiovanni, AJ, Minderer, Matthias, Pavetic, Filip, Waters, Austin, Li, Gang, Alabdulmohsin, Ibrahim, Beyer, Lucas, Amelot, Julien, Lee, Kenton, Steiner, Andreas Peter, Li, Yang, Keysers, Daniel, Arnab, Anurag, Xu, Yuanzhong, Rong, Keran, Kolesnikov, Alexander, Seyedhosseini, Mojtaba, Angelova, Anelia, Zhai, Xiaohua, Houlsby, Neil, Soricut, Radu

arXiv.org Artificial IntelligenceMay-29-2023

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

machine learning, natural language, question answering, (20 more...)

arXiv.org Artificial Intelligence

2305.18565

Country: North America > United States > Louisiana (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.92)
Media (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.55)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.46)
(2 more...)

Add feedback

Verbs in Action: Improving verb understanding in video-language models

Momeni, Liliane, Caron, Mathilde, Nagrani, Arsha, Zisserman, Andrew, Schmid, Cordelia

arXiv.org Artificial IntelligenceApr-13-2023

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding: video-text matching, video question-answering and video classification. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it.

artificial intelligence, caption, natural language, (20 more...)

arXiv.org Artificial Intelligence

2304.06708

Genre: Research Report (0.63)

Industry:

Leisure & Entertainment > Sports (1.00)
Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback