Goto

Collaborating Authors

 caption generator


Weakly Supervised Dense Event Captioning in Videos

Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, Junzhou Huang

Neural Information Processing Systems

Among the wide variety of applications on video understanding, the video captioning task is attracting more and more interests in recent years [4, 5, 6, 7, 8, 9, 10, 11].




LangDA: Building Context-Awareness via Language for Domain Adaptive Semantic Segmentation

Liu, Chang, Balaji, Bavesh, Hossain, Saad, Thomas, C, Lai, Kwei-Herng, Vemulapalli, Raviteja, Wong, Alexander, Rambhatla, Sirisha

arXiv.org Machine Learning

Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label-rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision-only approaches using masking or multi-resolution crops, and (2) language-based approaches that use generic class-wise prompts informed by target domain (e.g. "a {snowy} photo of a {class}"). However, the former is susceptible to noisy pseudo-labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects -- key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM-generated scene descriptions (e.g. "a pedestrian is on the sidewalk, and the street is lined with buildings."). Second, LangDA aligns the entire image features with text representation of this context-aware scene caption and learns generalized representations via text. With this, LangDA sets the new state-of-the-art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.


Reviews: Turbo Learning for CaptionBot and DrawingBot

Neural Information Processing Systems

Summary: This paper proposed a joint aproach for learning two network: a capitonbot that generates a caption given an image and a drawingbot that generates an image given a caption. For both caption and image generators, the authors use existing network architecture. LSTM - based network that incorporates an image feature produced by Resnet is used for caption generation (the specific architecture is not clearly described). Attention GAN is used to generate an image from caption. The main contribution of this paper is joint training of caption and image generators by constructing two auto-encoders. An image auto-encoder consists of a caption generator feeding an image generator.


Infusing Environmental Captions for Long-Form Video Language Grounding

Lee, Hyogun, Hong, Soyeon, Sung, Mujeen, Choi, Jinwoo

arXiv.org Artificial Intelligence

In this work, we tackle the problem of long-form video-language grounding (VLG). Given a long-form video and a natural language query, a model should temporally localize the precise moment that answers the query. Humans can easily solve VLG tasks, even with arbitrarily long videos, by discarding irrelevant moments using extensive and robust knowledge gained from experience. Unlike humans, existing VLG methods are prone to fall into superficial cues learned from small-scale datasets, even when they are within irrelevant frames. To overcome this challenge, we propose EI-VLG, a VLG method that leverages richer textual information provided by a Multi-modal Large Language Model (MLLM) as a proxy for human experiences, helping to effectively exclude irrelevant frames. We validate the effectiveness of the proposed method via extensive experiments on a challenging EgoNLQ benchmark.


Explainable Image Captioning using CNN- CNN architecture and Hierarchical Attention

Mohan, Rishi Kesav, Sureshkumar, Sanjay, Sivasubramaniam, Vignesh

arXiv.org Artificial Intelligence

Image captioning is a technology that produces text-based descriptions for an image. Deep learning-based solutions built on top of feature recognition may very well serve the purpose. But as with any other machine learning solution, the user understanding in the process of caption generation is poor and the model does not provide any explanation for its predictions and hence the conventional methods are also referred to as Black-Box methods. Thus, an approach where the model's predictions are trusted by the user is needed to appreciate interoperability. Explainable AI is an approach where a conventional method is approached in a way that the model or the algorithm's predictions can be explainable and justifiable. Thus, this article tries to approach image captioning using Explainable AI such that the resulting captions generated by the model can be Explained and visualized. A newer architecture with a CNN decoder and hierarchical attention concept has been used to increase speed and accuracy of caption generation. Also, incorporating explainability to a model makes it more trustable when used in an application. The model is trained and evaluated using MSCOCO dataset and both quantitative and qualitative results are presented in this article.


Language Guided Adversarial Purification

Singh, Himanshu, Subramanyam, A V

arXiv.org Artificial Intelligence

Adversarial purification using generative models demonstrates strong adversarial defense performance. These methods are classifier and attack-agnostic, making them versatile but often computationally intensive. Recent strides in diffusion and score networks have improved image generation and, by extension, adversarial purification. Another highly efficient class of adversarial defense methods known as adversarial training requires specific knowledge of attack vectors, forcing them to be trained extensively on adversarial examples. To overcome these limitations, we introduce a new framework, namely Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators to defend against adversarial attacks. Given an input image, our method first generates a caption, which is then used to guide the adversarial purification process through a diffusion network. Our approach has been evaluated against strong adversarial attacks, proving its effectiveness in enhancing adversarial robustness. Our results indicate that LGAP outperforms most existing adversarial defense techniques without requiring specialized network training. This underscores the generalizability of models trained on large datasets, highlighting a promising direction for further research.


Towards Generating Diverse Audio Captions via Adversarial Training

Mei, Xinhao, Liu, Xubo, Sun, Jianyuan, Plumbley, Mark D., Wang, Wenwu

arXiv.org Artificial Intelligence

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.


Image Caption Generator using Deep Learning

#artificialintelligence

Our brain is capable of identifying or annotating each image that is shown to us. How can a computer analyse a picture and assign it a caption that is both extremely relevant and accurate? Building a useful caption generator for a picture was formerly thought to be very hard, but because to improvements in computer vision and deep learning techniques, the availability of pertinent datasets, and AI models, it is now much simpler. Many data annotation companies are making billions of dollars from caption creation, which is also expanding globally. In this tutorial, we'll show you how to create an annotation tool that can use datasets to provide descriptions for images that are extremely relevant.