ad generation
MCAD: Multimodal Context-Aware Audio Description Generation For Soccer
Chaudhary, Lipisha, Mittal, Trisha, Gopalakrishnan, Subhadra, Nwogu, Ifeoma, Pytlarz, Jaclyn
Abstract--Audio Descriptions (AD) are essential for making visual content accessible to individuals with visual impairments. Recent works have shown a promising step towards automating AD, but they have been limited to describing high-quality movie content using human-annotated ground truth AD in the process. In this work, we present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports, with a focus on soccer games, without relying on ground truth AD. T o address the absence of domain-specific AD datasets, we fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD. During inference, MCAD incorporates multimodal contextual cues such as player identities, soccer events/actions, and commentary from the game. These cues, combined with input prompts to the fine-tuned Video-LLM, allow the system to produce complete AD text for each video segment. We further introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD. ARGE-AD evaluates the generated AD for the presence of five characteristics: (i) usage of people's names, (ii) mention of actions/events, (iii) appropriate length of AD, (iv) absence of pronouns, and (v) overlap from commentary/subtitles. We present an in-depth analysis of our approach on both movie and soccer datasets. We also validate the use of this metric to quantitatively comment on the quality of generated AD using our metric across domains. Additionally, we contribute audio descriptions for 100 soccer game clips annotated by two AD experts. Audio Description (AD) is the descriptive spoken narration of visual content, primarily for assisting visual impairments in accessing visual content [1].
- Europe > Spain > Galicia > Madrid (0.04)
- Europe > United Kingdom > England (0.04)
- North America > United States > Colorado > Denver County > Denver (0.04)
- (6 more...)
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies
Gao, Yingqiang, Fischer, Lukas, Lintner, Alexa, Ebling, Sarah
Audio descriptions (ADs) function as acoustic commentaries designed to assist blind persons and persons with visual impairments in accessing digital media content on television and in movies, among other settings. As an accessibility service typically provided by trained AD professionals, the generation of ADs demands significant human effort, making the process both time-consuming and costly. Recent advancements in natural language processing (NLP) and computer vision (CV), particularly in large language models (LLMs) and vision-language models (VLMs), have allowed for getting a step closer to automatic AD generation. This paper reviews the technologies pertinent to AD generation in the era of LLMs and VLMs: we discuss how state-of-the-art NLP and CV technologies can be applied to generate ADs and identify essential research directions for the future.
- Europe > Switzerland > Zürich > Zürich (0.14)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- North America > United States > New York (0.04)
- (9 more...)
- Research Report (1.00)
- Overview (1.00)
- Leisure & Entertainment (0.93)
- Media > Television (0.46)
- Health & Medicine > Therapeutic Area (0.35)
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
Zhang, Chaoyi, Lin, Kevin, Yang, Zhengyuan, Wang, Jianfeng, Li, Linjie, Lin, Chung-Ching, Liu, Zicheng, Wang, Lijuan
We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips, MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process, which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information, including storylines and character identities, ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator, we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios, as measured by standard evaluation metrics. Additionally, we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4, this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.
Master AI Tools List
The Master AI Tool List was created to share a comprehensive list of sites related to Artificial Intelligence and Machine Learning. We want to make this the source for discovering and sharing the latest AI tools. If you know of an AI Tool that is not currently listed, Use the submission form to request a review. Together we can make this an important resource for everyone. Ai Sofiya is a super Ai tool that can create social media ads in under a minute.