AITopics | caption generator

Collaborating Authors

caption generator

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Weakly Supervised Dense Event Captioning in Videos

Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, Junzhou Huang

Neural Information Processing SystemsFeb-12-2026, 18:35:56 GMT

Among the wide variety of applications on video understanding, the video captioning task is attracting more and more interests in recent years [4, 5, 6, 7, 8, 9, 10, 11].

artificial intelligence, machine learning, video, (12 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.05)
North America > Canada > Quebec > Montreal (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback

Weakly Supervised Dense Event Captioning in Videos

Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, Junzhou Huang

Neural Information Processing SystemsNov-20-2025, 23:27:44 GMT

Wenwu Zhu is the corresponding author.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LangDA: Building Context-Awareness via Language for Domain Adaptive Semantic Segmentation

Liu, Chang, Balaji, Bavesh, Hossain, Saad, Thomas, C, Lai, Kwei-Herng, Vemulapalli, Raviteja, Wong, Alexander, Rambhatla, Sirisha

arXiv.org Machine LearningMar-16-2025

Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label-rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision-only approaches using masking or multi-resolution crops, and (2) language-based approaches that use generic class-wise prompts informed by target domain (e.g. "a {snowy} photo of a {class}"). However, the former is susceptible to noisy pseudo-labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects -- key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM-generated scene descriptions (e.g. "a pedestrian is on the sidewalk, and the street is lined with buildings."). Second, LangDA aligns the entire image features with text representation of this context-aware scene caption and learns generalized representations via text. With this, LangDA sets the new state-of-the-art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.

large language model, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

2503.1278

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
(2 more...)

Add feedback

Reviews: Turbo Learning for CaptionBot and DrawingBot

Neural Information Processing SystemsOct-7-2024, 20:30:38 GMT

Summary: This paper proposed a joint aproach for learning two network: a capitonbot that generates a caption given an image and a drawingbot that generates an image given a caption. For both caption and image generators, the authors use existing network architecture. LSTM - based network that incorporates an image feature produced by Resnet is used for caption generation (the specific architecture is not clearly described). Attention GAN is used to generate an image from caption. The main contribution of this paper is joint training of caption and image generators by constructing two auto-encoders. An image auto-encoder consists of a caption generator feeding an image generator.

caption generator, generator, image generator, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Infusing Environmental Captions for Long-Form Video Language Grounding

Lee, Hyogun, Hong, Soyeon, Sung, Mujeen, Choi, Jinwoo

arXiv.org Artificial IntelligenceAug-6-2024

In this work, we tackle the problem of long-form video-language grounding (VLG). Given a long-form video and a natural language query, a model should temporally localize the precise moment that answers the query. Humans can easily solve VLG tasks, even with arbitrarily long videos, by discarding irrelevant moments using extensive and robust knowledge gained from experience. Unlike humans, existing VLG methods are prone to fall into superficial cues learned from small-scale datasets, even when they are within irrelevant frames. To overcome this challenge, we propose EI-VLG, a VLG method that leverages richer textual information provided by a Multi-modal Large Language Model (MLLM) as a proxy for human experiences, helping to effectively exclude irrelevant frames. We validate the effectiveness of the proposed method via extensive experiments on a challenging EgoNLQ benchmark.

encoder, environment cue, video, (14 more...)

arXiv.org Artificial Intelligence

2408.02336

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

Explainable Image Captioning using CNN- CNN architecture and Hierarchical Attention

Mohan, Rishi Kesav, Sureshkumar, Sanjay, Sivasubramaniam, Vignesh

arXiv.org Artificial IntelligenceJun-28-2024

Image captioning is a technology that produces text-based descriptions for an image. Deep learning-based solutions built on top of feature recognition may very well serve the purpose. But as with any other machine learning solution, the user understanding in the process of caption generation is poor and the model does not provide any explanation for its predictions and hence the conventional methods are also referred to as Black-Box methods. Thus, an approach where the model's predictions are trusted by the user is needed to appreciate interoperability. Explainable AI is an approach where a conventional method is approached in a way that the model or the algorithm's predictions can be explainable and justifiable. Thus, this article tries to approach image captioning using Explainable AI such that the resulting captions generated by the model can be Explained and visualized. A newer architecture with a CNN decoder and hierarchical attention concept has been used to increase speed and accuracy of caption generation. Also, incorporating explainability to a model makes it more trustable when used in an application. The model is trained and evaluated using MSCOCO dataset and both quantitative and qualitative results are presented in this article.

attention model, caption, image captioning, (16 more...)

arXiv.org Artificial Intelligence

2407.09556

Country:

North America > United States > Ohio > Franklin County > Columbus (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Language Guided Adversarial Purification

Singh, Himanshu, Subramanyam, A V

arXiv.org Artificial IntelligenceSep-19-2023

Adversarial purification using generative models demonstrates strong adversarial defense performance. These methods are classifier and attack-agnostic, making them versatile but often computationally intensive. Recent strides in diffusion and score networks have improved image generation and, by extension, adversarial purification. Another highly efficient class of adversarial defense methods known as adversarial training requires specific knowledge of attack vectors, forcing them to be trained extensively on adversarial examples. To overcome these limitations, we introduce a new framework, namely Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators to defend against adversarial attacks. Given an input image, our method first generates a caption, which is then used to guide the adversarial purification process through a diffusion network. Our approach has been evaluated against strong adversarial attacks, proving its effectiveness in enhancing adversarial robustness. Our results indicate that LGAP outperforms most existing adversarial defense techniques without requiring specialized network training. This underscores the generalizability of models trained on large datasets, highlighting a promising direction for further research.

caption, diffusion model, purification, (15 more...)

arXiv.org Artificial Intelligence

2309.10348

Country: Asia > India > NCT > Delhi (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.90)

Add feedback

Towards Generating Diverse Audio Captions via Adversarial Training

Mei, Xinhao, Liu, Xubo, Sun, Jianyuan, Plumbley, Mark D., Wang, Wenwu

arXiv.org Artificial IntelligenceDec-5-2022

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.

caption, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2212.02033

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Image Caption Generator using Deep Learning

#artificialintelligenceDec-4-2022, 10:15:06 GMT

Our brain is capable of identifying or annotating each image that is shown to us. How can a computer analyse a picture and assign it a caption that is both extremely relevant and accurate? Building a useful caption generator for a picture was formerly thought to be very hard, but because to improvements in computer vision and deep learning techniques, the availability of pertinent datasets, and AI models, it is now much simpler. Many data annotation companies are making billions of dollars from caption creation, which is also expanding globally. In this tutorial, we'll show you how to create an annotation tool that can use datasets to provide descriptions for images that are extremely relevant.

caption, caption generator, dataset, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback