Variational Structured Semantic Inference for Diverse Image Captioning

Fuhai Chen, Rongrong Ji, Jiayi Ji, Xiaoshuai Sun, Baochang Zhang, Xuri Ge, Yongjian Wu, Feiyue Huang, Yan Wang

Oct-3-2025, 07:36:58 GMT–Neural Information Processing Systems

Despite the exciting progress in image captioning, generating diverse captions for a given image remains as an open problem. Existing methods typically apply generative models such as V ariational Auto-Encoder to diversify the captions, which however neglect two key factors of diverse expression, i.e., the lexical diversity and the syntactic diversity. To model these two inherent diversities in image captioning, we propose a V ariational Structured Semantic Inferring model (termed VSSI-cap) executed in a novel structured encoder-inferer-decoder schema. VSSI-cap mainly innovates in a novel structure, i.e., V ariational Multi-modal Inferring tree (termed V arMI-tree). In particular, conditioned on the visual-textual features from the encoder, the V arMI-tree models the lexical and syntactic diversities by inferring their latent variables (with variations) in an approximate posterior inference guided by a visual semantic prior. Then, a reconstruction loss and the posterior-prior KL-divergence are jointly estimated to optimize the VSSI-cap model. Finally, diverse captions are generated upon the visual features and the latent variables from this structured encoder-inferer-decoder model. Experiments on the benchmark dataset show that the proposed VSSI-cap achieves significant improvements over the state-of-the-arts.

caption, diversity, syntactic diversity, (17 more...)

Neural Information Processing Systems

Oct-3-2025, 07:36:58 GMT

Conferences PDF

Add feedback

Country:
- North America > Canada (0.04)
- Asia > China
  - Fujian Province > Xiamen (0.04)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.70)

Duplicate Docs Excel Report

Title
Variational Structured Semantic Inference for Diverse Image Captioning

Similar Docs Excel Report more

Title	Similarity	Source
None found